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Formal languages let us define the textual representation of data with precision. Formal gram- 
mars, typically in the form of BNF-like productions, describe the language syntax, which is then 
annotated for syntax-directed translation and completed with semantic actions. When, apart from 
the textual representation of data, an explicit representation of the corresponding data structure 
is required, the language designer has to devise the mapping between the suitable data model 
and its proper language specification, and then develop the conversion procedure from the parse 
tree to the data model instance. Unfortunately, whenever the format of the textual representa- 
tion has to be modified, changes have to propagated throughout the entire language processor 
tool chain. These updates are time-consuming, tedious, and error-prone. Besides, in case differ- 
ent applications use the same language, several copies of the same language specification have 
to be maintained. In this paper, we introduce a model-based parser generator that decouples 
language specification from language processing, hence avoiding many of the problems caused by 
grammar-driven parsers and parser generators. 



I. INTRODUCTION 

A formal language represents a set of strings [29] . For- 
mal languages consist of an alphabet, which describes 
the basic symbol or character set of the language, and a 
grammar, which describes how to write valid sentences 
of the language. In Computer Science, formal languages 
are used, among other things, for the precise definition of 
data formats and the syntax of programming languages. 
The front-end of a language processor, such as an inter- 
preter or compiler, determines the grammatical structure 
corresponding to the textual representation of data con- 
forming to a given language specification. Such gram- 
matical structure is typically represented in the form of 
a parse tree. 

Most existing language specification techniques [4] re- 
quire the language designer to provide a textual specifica- 
tion of the language grammar. The proper specification 
of such a grammar is a nontrivial process that depends 
on the lexical and syntactic analysis techniques to be 
used, since each kind of technique requires the grammar 
to comply with different constraints. Each analysis tech- 
nique is characterized by its expression power and this 
expression power determines whether a given analysis 
technique is suitable for a particular language. The most 
significant constraints on formal language specification 
originate from the need to consider context-sensitivity, 
the need of performing an efficient analysis, and some 
techniques' inability to resolve conflicts caused by gram- 
mar ambiguities. 

In its most general sense, a model is anything used 
in any way to represent something else. In such sense, 
a grammar is a model of the language it defines. In 
Software Engineering, data models are also common. 
Data models explicitly determine the structure of data. 
Roughly speaking, they describe the elements they repre- 
sent and the relationships existing among them. From a 
formal point of view, it should be noted that data mod- 



els and grammar-based language specifications are not 
equivalent, even though both of them can be used to 
represent data structures. A data model can express re- 
lationships a grammar-based language specification can- 
not. Moreover, a data model does not need to comply 
with the constraints a grammar-based language speci- 
fication has to comply with. Hence describing a data 
model is generally easier than describing the correspond- 
ing grammar-based language specification. 

When both a data model and a grammar-based lan- 
guage processor are required, the implementation of the 
language processor requires the language designer to 
build a grammar-based language specification from the 
data model and also to implement the conversion from 
the parse tree to the data model instance. 

Whenever the language specification has to be mod- 
ified, the language designer has to manually propagate 
changes throughout the entire language processor tool 
chain, from the specification of the grammar defining 
the formal language (and its adaptation to specific pars- 
ing tools) to the corresponding data model. These up- 
dates are time-consuming, tedious, and error-prone. By 
making such changes labor-intensive, the traditional ap- 
proach hampers the maintainability and evolution of the 
language [31j . 

Moreover, it is not uncommon that different applica- 
tions use the same language. For example, the com- 
piler, different code generators, and other tools within 
an IDE, such as the editor or the debugger, typically 
need to grapple with the full syntax of a programming 
language. Their maintenance typically requires keeping 
several copies of the same language specification in sync. 

As an alternative approach, a language can also be 
defined by a data model that, in conjunction with the 
declarative specification of some constraints, can be auto- 
matically converted into a grammar-based language spec- 
ification [52| . 

This way, the data model representing the language 



2 



can be modified as needed without having to worry about 
the language processor and the peculiarities of the cho- 
sen parsing technique, since the corresponding language 
processor with be automatically updated. 

Furthermore, as the data model is the direct repre- 
sentation of a data structure, such data structure can be 
implemented as an abstract data type (in object-oriented 
languages, as a set of collaborating classes). Following 
the proper software design principles, that implementa- 
tion can be performed without having to embed or mix 
semantic actions with the language specification, as it is 
typically the case with grammar-driven language proces- 
sors. 

Finally, as the data model is not bound to a particular 
parsing technique, evaluating alternative and/or comple- 
mentary parsing techniques is possible without having 
to propagate their constraints into the language model. 
Therefore, by using an annotated data model, model- 
based language specification completely decouples lan- 
guage specification from language processing, which can 
be performed using whichever parsing techniques that are 
suitable for the formal language implicitly defined by the 
model. 

In this paper, we introduce ModelCC, a model-based 
tool for language specification. As a parser genera- 
tor that decouples language specification from language 
processing, it avoids many of the problems caused by 
grammar-driven parsers and parser generators. 

Section [TT] introduces formal grammars and surveys 
parsing algorithms and tools. Section IIIII presents the 
philosophy behind model-based language specification. 
Section ITVl comments on ModelCC building blocks. Sec- 
tion [V] describes the model constraints supported by 
ModelCC, which declaratively define the features of the 
formal language defined by the model. Section I VII shows 
a prototypical example, which is used to discuss the ad- 
vantages of model-based language specification over tra- 
ditional grammar-based language specification. Lastly, 
Section [VIII presents our conclusions and the future work 
that derives from our research. 



II. BACKGROUND 

In this section, we introduce formal grammars (Sub- 
section A) , describe the typical architecture of language 
processor front-ends (Subsection B), survey key parsing 
algorithms (Subsection C), and review existing lexer and 
parser generators (Subsection D). 



A. Formal Grammars 

Formal grammars are used to specify the syntax of a 
language [22|, [24|. A grammar naturally describes the 
hierarchical structure of most language constructs §■ 
Using a set of rules, a grammar describes how to form 
strings from the language alphabet that are valid accord- 



ing to the language syntax. A grammar G is formally 
defined @ as the tuple (A, S, P, S), where: 

• N is the finite set of nonterminal symbols of the lan- 
guage, sometimes called syntactic variables, none of 
which appear in the language strings. 

• E is the finite set of terminal symbols of the lan- 
guage, also called tokens, which constitute the lan- 
guage alphabet (i.e. they appear in the language 
strings). Therefore, S is disjoint from N. 

• P is a finite set of productions, each one of the form 
(S U A)*A(£ U N)* -> (S U N)*, where * is the 
Kleene star operator, U denotes set union, the part 
before the arrow is called the left-hand side of the 
production, and the part after the arrow is called 
the right-hand side of the production. 

• S is a distinguished nonterminal symbol, S G N: 
the grammar start symbol. 

For convenience, when several productions share their 
left-hand side, they can be grouped into a single produc- 
tion containing their the shared left-hand side and all 
their different right-hand sides separated by |. 

Context-free grammars are formal grammars in which 
the left-hand side of each production consists of only a 
single nonterminal symbol. All their productions, there- 
fore, are of the form N — > (£ U N)* . Context-free gram- 
mars generate context-free languages. 

A context-free grammar is said to be ambiguous if 
there exists at least a string that can be generated by the 
grammar in more than one way. In fact, some context- 
free languages are inherently ambiguous (i.e. all context- 
free grammars generating them are ambiguous). 

An attribute grammar is a formal way to define at- 
tributes for the symbols in the productions of a formal 
grammar, associating values to these attributes. Seman- 
tic rules annotate attribute grammar and define the value 
of an attribute in terms of the values of other attributes 
and constants Q. They key feature of attribute gram- 
mars is that they let us transfer information from any- 
where in the abstract syntax tree to anywhere else in a 
controlled and formal way, hence their frequent use in 
syntax-directed translation. 

Graph grammars jl 31 ] allow the manipulation of graphs 
based on productions: if the left-hand side of a produc- 
tion matches the working graph or a subgraph of it, it 
can be replaced with the right-hand side of the produc- 
tion. These grammars can be used to define the syntax 
of visual languages. 

B. The Architecture of Language Processors 

The architecture of a language-processing system de- 
composes language processing into several steps which 
are typically grouped into two phases: analysis and syn- 
thesis. The analysis phase, which is responsibility of the 
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language processor front end, starts by breaking up its 
input into its constituent pieces (lexical analysis or scan- 
ning) and imposing a grammatical structure upon them 
(syntax analysis or parsing). The language processor 
back end will later synthesize the desired target from the 
results provided by the front end. 

A lexical analyzer, also called lexer or scanner, pro- 
cesses an input string conforming to a language specifi- 
cation and produces the tokens found within it. Lexical 
ambiguities occur when a given input string simultane- 
ously corresponds to several token sequences [43j] , which 
may overlap. 

A syntactic analyzer, also called parser, processes in- 
put tokens and determines their grammatical structure 
with respect to the given language grammar, usually in 
the form of parse trees. In the absence of lexical ambi- 
guities, the parser input consists of a stream of tokens, 
whereas it will be a directed acyclic graph of tokens when 
lexical ambiguities are present. Syntactic ambiguities 
occur when a given set of tokens simultaneously corre- 
sponds to several parse trees @. 

C. Scanning and Parsing Algorithms 

Scanning and parsing algorithms are characterized by 
the expression power of the languages they can apply 
to, their support for ambiguities or lack thereof, and the 
constraints they impose on language specifications. 

Traditional lexers are based on a finite-state machine 
that is built from a set of regular expressions [39j , each 
of which describes a token type. The efficiency of regular 
expression lexers is 0(n), being n the input string length. 

Lamb [5l| is a lexer with lexical ambiguity support 
that allows the specification of tokens not only by regu- 
lar expressions, but also by arbitrary pattern matching 
classes. Lamb also supports token type precedences. The 
efficiency of the Lamb lexer is, in the worst case, 0(n 2 t 2 ), 
being n the input string length and t the number of dif- 
ferent token types. 

Efficient parsers for specific subsets of context-free 
grammars exist. These include top-down LL parsers, 
which construct a leftmost derivation of the input sen- 
tence, and bottom-up LR parsers, which construct a 
rightmost derivation of the input sentence. 

LL grammars were formally introduced in (37j . albeit 
LL(k) parsers predate their name (46|. An LL parser is 
called an LL(k) parser if it uses k tokens of lookahead 
when parsing a sentence, while it is an LL(*) parser if it 
is not restricted to a finite k tokens of lookahead and it 
can make parsing decisions by recognizing whether the 
following tokens belong to a regular language [13, EH- 
While LL(k) parsers are always linear, LL(*) ranges from 
0{n) to 0{n 2 ). 

LR parsers were introduced by Knuth (33|. DeRemer 
later developed the LALR [1, El and SLR parsers 
that are in use today. When parsing theory was originally 
developed, machine resources were scarce, and so parser 
efficiency was the paramount concern (47j . Hence all the 



aforementioned parsing algorithms parse in linear time 
(i.e. their efficiency is 0(n), being n the input string 
length) and they do not support syntactic ambiguities. 

Efficient LR and LL parsers for certain classes of am- 
biguous grammars are also possible by using simple dis- 
ambiguating rules 0, [HI . 

A general-purpose dynamic programming algorithm 
for parsing context-free grammars was independently de- 
veloped by Cocke Q, Younger [57}, and Kasami [3(j: the 
CYK parser. This general-purpose algorithm is 0(n 3 ) 
for ambiguous and unambiguous context-free grammars. 
The Earley parser [pp ] is another general-purpose dy- 
namic programming algorithm for parsing context-free 
grammars that executes in cubic time (0(n 3 )) in the 
general case, quadratic time (0(n 2 )) for unambiguous 
grammars, and linear time (0(n)) for almost all LR(k) 
grammars. 

Free from the requirement to develop efficient linear- 
time parsing algorithms, researchers have developed 
many powerful nondctcrministic parsing strategies fol- 
lowing both the top-down approach (LL parsers) and the 
bottom- up approach (LR parsers). 

Following the top-down approach, Packrat parsers 
[l5( | and their associated Parsing Expression Grammars 
(PEGs) [f6j] preclude only the use of left-recursive gram- 
mar rules and force the rules to be ordered, that is, when 
an alternative succeeds, the following alternatives are 
ignored. Even though they use backtracking, packrat 
parsers are linear rather than exponential because of the 
rule order and because they memoize partial results. 

Following the bottom-up approach, Generalized LR 
(GLR) parsers parse in linear to cubic time, depending on 
how closely the grammar conforms to the underlying LR 
strategy. The time required to run the algorithm is pro- 
portional to the degree of nondeterminism in the gram- 
mar. Bernard Lang is typically credited with the original 
GLR idea [Hj]. Later, Tomita used the algorithm for nat- 
ural language processing [55j . Tomita's Universal parser 
[56j . however, failed for grammars with epsilon rules (i.e. 
productions with an empty right-hand side). Several ex- 
tensions have been proposed that support epsilon rules 

GLR parsing is an optimization of Earley's algorithm 
and, like parsing expression grammars, tends to be scan- 
ner less. Being scanner less is necessary if a tool needs 
to support grammar composition (i.e. to easily inte- 
grate one language into another or create new grammars 
by modifying and composing pieces from existing gram- 
mars), unless lexical ambiguities are supported. The 
Fence parser [5(| is an optimized Earley-like algorithm 
that works on top of the Lamb lexer [5~i| , which provides 
support for lexical ambiguities. Hence, Fence is also suit- 
able for grammar composition and as efficient as GLR 
parsers in practice. 

D. Lexer and Parser Generators 

Lexer and parser generators are tools that take a lan- 
guage specification as input and produce a lexer or parser 
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as output. They can be characterized by their input syn- 
tax, their ability to specify semantic actions, and the 
parsing algorithms the resulting parsers implement. 

Lex [35l] and yacc [28j are commonly used in conjunc- 
tion [36[ . They are the default lexer generator and parser 
generator in many Unix environments and standard com- 
piler textbooks use them as examples, e.g. Q. Lex is 
the prototypical regular-expression-based lexer genera- 
tor, while yacc and its derivatives generate LALR parsers. 

JavaCC |38| is a parser generator that creates LL(k) 
parsers, albeit it has been superseded by ANTLR [48j | . 
ANTLR is a scannerless parser generator that creates 
LL(*) parsers. ANTLR-generated parsers are linear in 
practice and greatly reduce speculation, reducing the 
mcmoization overhead of pure packrat parsers. 

The Rats! [23| packrat parser generator is a PEG- 
based tool that also optimizes memoization to improve its 
speed and reduce its memory footprint. Like ANTLR, it 
does not accept left-recursive grammars. Unlike ANTLR, 
programmers do not have to deal with conflict messages, 
since PEGs have no concept of a grammar conflict: they 
always choose the first possible interpretation, which can 
lead to unexpected behavior. 

NLyacc [26| and Elkhound [40| are examples of GLR 
parser generators. Elkhound achieves yacc-like parsing 
speeds when grammars are LALR(l) but suffers from the 
practical limitations of GLR parsers. Like PEG parsers, 
GLR parsers do not always do what is intended, in part 
because they accept ambiguous grammars and program- 
mers have to detect ambiguities dynamically [47(. 

YAJco [49( is an interesting tool that accepts, as in- 
put, a set of Java classes with annotations that specify 
the prefixes, suffixes, operators, tokens, parentheses, and 
optional elements common in typical programming lan- 
guages. As output, YAJco generates a BNF-like gram- 
mar specification for JavaCC [Hj]. Since YAJco is built 
on top of a parser generator, the language designer has 
to be careful when annotating his classes, as the im- 
plicit grammar he is defining has to comply with the 
constraints imposed by the underlying LL(k) parser gen- 
erator. 

To the best of our knowledge, no existing parser gen- 
erator follows the approach we now proceed to describe. 



III. MODEL-BASED LANGUAGE SPECIFICATION 

In this section, we discuss the concepts of abstract 
and concrete syntax, analyze the potential advantages 
of model-based language specification, and compare our 
proposed approach with the traditional grammar-driven 
language design process. 



A. Abstract Syntax and Concrete Syntaxes 

The abstract syntax of a language is just a represen- 
tation of the structure of the different elements of a lan- 



guage without the superfluous details related to its par- 
ticular textual representation [32] . On the other hand, 
a concrete syntax is a particularization of the abstract 
syntax that defines, with precision, a specific textual or 
graphical representation of the language. It should also 
be noted that a single abstract syntax can be shared by 
several concrete syntaxes 

For example, the abstract syntax of the typical <if>- 
<then>-< optional else> statement in imperative pro- 
gramming languages could be described as the concate- 
nation of a conditional expression and one or two state- 
ments. Different concrete syntaxes could be defined for 
such an abstract syntax, which would correspond to dif- 
ferent textual representations of a conditional statement, 
e.g. {"if", "(", expression, ")", statement, optional "else" 
followed by another statement} and {"IF", expression, 
"THEN", statement, optional "ELSE" followed by an- 
other statement, "ENDIF"}. 

The idea behind mode-based language specification is 
that, starting from a single abstract syntax model (ASM) 
representing the core concepts in a language, language 
designers would later develop one or several concrete syn- 
tax models (CSMs). These concrete syntax models would 
suit the specific needs of the desired textual or graphi- 
cal representation. The ASM-CSM mapping could be 
performed, for instance, by annotating the abstract syn- 
tax model with the constraints needed to transform the 
elements in the abstract syntax into their concrete rep- 
resentation. 



B. Advantages of Model-Based Language Specification 

Focusing on the abstract syntax of a language offers 
some benefits [H| and provides some potential advan- 
tages to model-based language specification over the tra- 
ditional grammar-based language specification approach: 

• When reasoning about the features a language 
should include, specifying its abstract syntax seems 
to be a better starting point than working on its 
concrete syntax details. In fact, we control com- 
plexity by building abstractions that hide details 
when appropriate 

• Sometimes, different incarnations of the same ab- 
stract syntax might be better suited for different 
purposes (e.g. an human-friendly syntax for man- 
ual coding, a machine-oriented format for auto- 
matic code generation, a Fit-like 42] syntax for 
testing, different architectural views for discussions 
with project stakeholders...). Therefore, it might 
be useful for a given language to support multiple 
syntaxes. 

• Since model-based language specification is inde- 
pendent from specific lexical and syntactic analy- 
sis techniques, the constraints imposed by specific 
parsing algorithms do not affect the language de- 
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Figure 1 Traditional language processing approach. 



Figure 2 Model-based language processing approach. 



sign process. In principle, it might not be even nec- 
essary for the language designer to have advanced 
knowledge on parser generators when following a 
model-driven language specification approach. 

• A full-blown model-driven language workbench [l8[ 
would allow the modification of a language ab- 
stract syntax model and the automatic generation 
of a working IDE on the run. The specification 
of domain-specific languages would become easier, 
as the language designer could play with the lan- 
guage specification and obtain a fully-functioning 
language processor on the fly, without having to 
worry about the propagation of changes through- 
out the complete language processor tool chain. 

In short, the model-driven language specification ap- 
proach brings domain-driven design [14j to the domain of 
language design. It provides the necessary infrastructure 
for what Evans would call the 'supple design' of language 
processing tools: the intention-revealing specification of 
languages by means of abstract syntax models, the sep- 
aration of concerns in the design of language processing 
tools by means of declarative ASM-CSM mappings, and 
the automation of a significant part of the language pro- 
cessor implementation. 



C. Comparison with the Traditional Approach 

A diagram summarizing the traditional language de- 
sign process is shown in Figure [TJ whereas the corre- 
sponding diagram for the model-based approach pro- 
posed in this paper is shown in Figure [5] 

When following the traditional grammar-driven ap- 
proach, the language designer starts by designing the 
grammar corresponding to the concrete syntax of the 
desired language, typically in BNF or a similar for- 
mat. Then, the designer annotates the grammar with 
attributes and, probably, semantic actions, so that the 
resulting attribute grammar can be fed into lexer and 
parser generator tools that produce the corresponding 
lexer and parser, respectively. The resulting syntax- 
directed translation process generates abstract syntax 



trees from the textual representation in the concrete syn- 
tax of the language. 

When following the model-driven approach, the lan- 
guage designer starts by designing the conceptual model 
that represents the abstract syntax of the desired lan- 
guage, focusing on the elements the language will rep- 
resent and their relationships. Instead of dealing with 
the syntactic details of the language from the start, the 
designer devises a conceptual model for it (i.e. the ab- 
stract syntax model, or ASM) , the same way a database 
designer starts with an implementation-independent con- 
ceptual database schema before he converts that schema 
into a logical schema that can be implemented in the par- 
ticular kind of DBMS that will host the final database. 
In the model-driven language design process, the ASM 
would play the role of entity-relationship diagrams in 
database design and each particular CSM would corre- 
spond to the final table layout of the physical database 
schema in a relational DBMS. 

Even though the abstract syntax model of the language 
could be converted into a suitable concrete syntax model 
automatically, the language designer will often be inter- 
ested in specifying the details of the ASM-CSM mapping. 
With the help of constraints imposed over the abstract 
model, the designer will be able to guide the conversion 
from the ASM to its concrete representation using a par- 
ticular CSM. This concrete model, when it corresponds 
to a textual representation of the abstract model, will 
be described by a formal grammar. It should be noted, 
however, that the specification of the ASM is independent 
from the peculiarities of the desired CSM, as a database 
designer does not consider foreign keys when designing 
the conceptual schema of a database. Therefore, the 
grammar specification constraints enforced by particu- 
lar parsing tools will not impose limits on the design of 
the ASM. The model-driven language processing tool will 
take charge of that and, ideally, it will employ the most 
efficient parsing technique that works for the language 
resulting from the ASM-CSM mapping. 

While the traditional language designer specifies the 
grammar for the concrete syntax of the language, anno- 
tates it for syntax-directed processing, and obtains an ab- 
stract syntax tree that is an instance of the implicit con- 
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ceptual model defined by the grammar, the model-based 
language designer starts with an explicit full-fledged con- 
ceptual model and specifies the necessary constraints for 
the ASM-CSM mapping. In both cases, parser generators 
create the tools that parse the input text in its concrete 
syntax. The difference lies in the specification of the 
grammar that drives the parsing process, which is hand- 
crafted in the traditional approach and automatically- 
generated as a result of the ASM-CSM mapping in the 
model-driven process. 

Another difference stems from the fact that the result 
of the parsing process in an instance of an implicit model 
in the grammar-driven approach while that model is ex- 
plicit in the model-driven approach. An explicit concep- 
tual model is absent in the traditional language design 
process albeit that does not mean that it does not exist. 
On the other hand, the model-driven approach enforces 
the existence of an explicit conceptual model, which lets 
the proposed approach reap the benefits of domain-driven 
design. 

There is a third difference between the grammar-driven 
and the model-driven approaches to language specifica- 
tion. While, in general, the result of the parsing process 
is an abstract syntax tree that corresponds to a valid 
parsing of the input text according to the language con- 
crete syntax, nothing prevents the conceptual model de- 
signer from modeling non-tree structures. Hence the use 
of the 'abstract syntax graph' term in Figure [2j This 
might be useful, for instance, for modeling graphical lan- 
guages, which are not constrained by the linear nature of 
the traditional syntax-driven specification of text-based 
languages. 

Instead of going from a concrete syntax model to an 
implicit abstract syntax model, as it is typically done, 
the model-based language specification process goes from 
the abstract to the concrete. This alternative approach 
facilitates the proper design and implementation of lan- 
guage processing systems by decoupling language pro- 
cessing from language specification, which is now per- 
formed by imposing declarative constraints on the ASM- 
CSM mapping. 



IV. MODELCC MODEL SPECIFICATION 

Once we have described model-driven language spec- 
ification in general terms, we now proceed to introduce 
ModelCC, a tool that supports our proposed approach to 
the design of language processing systems. ModelCC, at 
its core, acts as a parser generator. Its starting abstract 
syntax model is created by defining classes that represent 
language elements and establishing relationships among 
those elements (associations in UML terms). Once the 
abstract syntax model is established, its incarnation as 
a concrete syntax is guided by the constraints imposed 
over language elements and their relationships as anno- 
tations on the abstract syntax model. In other words, 
the declarative specification of constraints over the ASM 



establishes the desired ASM-CSM mapping. 

In this section, we introduce the basic constructs that 
allow the specification of abstract syntax models, while 
we will discuss how model constraints help us establish a 
particular ASM-CSM mapping in the following section 
of this paper. Basically, the ASM is built on top of 
basic language elements, which might be viewed as the 
tokens in the model-driven specification of a language. 
ModelCC provides the necessary mechanisms to combine 
those basic elements into more complex language con- 
structs, which correspond to the use of concatenation, 
selection, and repetition in the syntax-driven specifica- 
tion of languages. 

Our final goal is to allow the specification of languages 
in the form of abstract syntax models such as the one 
shown in Figure 1291 which will be used as an example 
in Section IVI1 This model, in UML format, specifies 
the abstract syntax model of the language supported by 
a simple arithmetic calculator. The annotations that ac- 
company the model provide the necessary information for 
establishing the complete ASM-CSM mapping that cor- 
responds to the traditional infix notation for arithmetic 
expressions. Moreover, the model also incorporates the 
method that lets us evaluate such arithmetic expressions. 
Therefore, Figure [55] represents a complete interpreter for 
arithmetic expressions in infix notation using ModelCC 
(its complete implementation as a set of cooperating Java 
classes appears in Figure [30]) . 

As mentioned above, the specification of the ASM in 
ModelCC starts with the definition of basic language el- 
ements, which can be modeled as simple classes in an 
object-oriented programming language. The ASM-CSM 
mapping of those basic elements will establish their corre- 
spondence to the tokens that appear in the concrete syn- 
tax of the language whose ASM we design in ModelCC. 
In the following subsections, we describe the mechanisms 
provided by ModelCC to implement the three main con- 
structs that let us specify complete abstract syntax mod- 
els on top of basic language elements. 



A. Concatenation 

Concatenation is the most basic construct we can use 
to combine sets of language elements into more complex 
language elements. In textual languages, this is achieved 
just by joining the strings representing its constituent 
language elements into a longer string that represents 
the composite language element. 

In ModelCC, concatenation is achieved by object com- 
position. The resulting language element is the compos- 
ite element and its members are the language elements 
the composite element collates. 

When translating the ASM into a textual CSM, each 
composite element in a ModelCC model generates a pro- 
duction rule in the grammar representing the CSM. This 
production, with the nonterminal symbol of the compos- 
ite element in its left-hand side, concatenates the nonter- 
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AssiqnmentStatement 

- id : Identifier 

- exp : Expression 



-id 



Identifier 



-exp 



Expression 



Expression 



AAA 



UnaryExpression 



BinaryExpression 



Figure 3 An assignment statement as an example of element 
composition (concatenation in textual CSM terms). 



minal symbols corresponding to the constituent elements 
of the composite element in its right-hand side. By de- 
fault, the order of the constituent elements in the pro- 
duction rule is given by the order in which they are spec- 
ified in the object composition, but such an order is not 
strictly necessary (e.g. many ambiguous languages might 
require unordered sequences of constituent elements and 
even some unambiguous languages allow for such flexi- 
bility). 

The model in Figure [3] shows an example of object 
composition in ASM terms that corresponds to string 
concatenation in CSM terms. In this example, an as- 
signment statement is composed of an identifier, i.e. a 
reference to its 1-value, and an expression, which pro- 
vides its r-value. In a textual CSM, the composite As- 
signmentStatement element would be translated into the 
following production rule: <AssignmentStatement> ::= 
<Identifier> <Expression>. Obviously, such production 
would probably include some syntactic sugar in an ac- 
tual programming language, either for avoiding poten- 
tial ambiguities or just for improving its readability and 
writability, but that is the responsibility of ASM-CSM 
mappings, which will be analyzed in Section fVl 



B. Selection 

Selection is necessary as a language modeling primitive 
operation to represent choices, so that we can specify 
alternative elements in language constructs. 

In ModelCC, selection is achieved by subtyping. Spec- 
ifying inheritance relationships among language elements 
in an object-oriented context corresponds to defining 'is- 
a' relationships in a more traditional database design set- 
ting. The language element we wish to establish alterna- 
tives for is the superelement (i.e. the superclass in 00 or 
the supertype in DB modeling), whereas the different al- 
ternatives are represented as subelements (i.e. subclasses 
in 00, subtypes in DB modeling). Alternative elements 
are always kept separate to enhance the modularity of 
ModelCC abstract syntax models and their integration 
in language processing systems. 

In the current version of ModelCC, multiple inheri- 
tance is not supported, albeit the same results can be 
easily simulated by combining inheritance and compo- 
sition. We can define subelements for the different in- 



Par en th esizedExpr essi on 



Figure 4 Subtyping for representing choices in ModelCC. 



heritance hierarchies representing choices so that those 
subelements are composed by the single element that 
appears as a common choice in the different scenarios. 
This solution fits well with most existing programming 
languages, which do not always support multiple inher- 
itance, and avoids the pollution of the shared element 
interface in the ASM, which would appear as a side ef- 
fect of allowing multiple inheritance in abstract syntax 
models. 

Each inheritance relationship in ModelCC, when con- 
verting the ASM into a textual CSM, generates a pro- 
duction rule in the CSM grammar. In those productions, 
the nonterminal symbol corresponding to the superele- 
ment appears in its left-hand side, while the nonterminal 
symbol of the subelement appears as the only symbol 
in the production right-hand side. Obviously, if a given 
superelement has k different subelements, k different pro- 
ductions will be generated representing the k alternatives 
defined by the abstract syntax model. 

For example, the model shown in Figure [4] il- 
lustrates how an arithmetic Expression can be an 
UnaryExpression, a BinaryExpression, or a Paren- 
thesizedExpression in the language defined for a 
simple calculator. The grammar resulting from 
the conversion of this ASM into a textual CSM 
would be: <Expression> ::= < UnaryExpression> \ 
<BinaryExpression> \ <ParenthesizedExpression>. 



C. Repetition 

Representing repetition is also necessary in abstract 
syntax models, since a language element might appear 
several times in a given language construct. When a vari- 
able number of repetitions is allowed, mere concatenation 
does not suffice. 

Repetition is also achieved though object composition 
in ModelCC, just by allowing different multiplicities in 
the associations that connect composite elements to their 
constituent elements. The cardinality constraints de- 
scribed in Section IV. CI can be used to annotate Mod- 
elCC models in order to establish specific multiplicities 
for repeatable language elements. 

Each composition relationship representing a repeti- 
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OutputStatement 

- exps : Expression 



-exps 



Expression 



Figure 5 Multiple composition for representing repetition in 
ModelCC. 



tive structure in the ASM will lead to two additional 
production rules in the grammar defining a textual 
CSM: a recursive production of the form <ElementList> 
::= <Element> <ElementList> and a complemen- 
tary production < ElementList> ::= <Element>, where 
<Element> is the nonterminal symbol associated to the 
repeating element. Obviously, an equivalent non-left- 
recursive derivation can also be obtained when needed. 

It should also be noted that < ElementList> will take 
the place of the nonterminal <Element> in the produc- 
tion derived from the composition relationship that con- 
nects the repeating element with its composite element 
(see the above section on how composition is employed 
to represent concatenation in ModelCC). 

In practice, repeating elements will often appear sepa- 
rated in the concrete syntax of a textual language, hence 
repeatable elements can be annotated with separators, 
as we will see in Section IV. B[ In case separators are 
employed, the recursive production derived from repeat- 
able elements will be of the form <ElementList> ::= 
<Element> <Separator> <ElementList>. 

When a repeatable language element is optional, i.e. 
its multiplicity can be 0, an additional epsilon production 
has to be appended to the grammar defining the textual 
CSM derived from the ASM: <ElementList> ::= e. 

For example, the model in Figure [5] shows that an Out- 
putStatement can include several Expressions, which will 
be evaluated for their results to be sent to the corre- 
sponding output stream. This ASM would result in the 
following textual CSM grammar: <OutputStatement> 
::= <ExpressionList> for describing the composition and 
<ExpressionList> ::= <Expression> <ExpressionList> 
| <Expression> for allowing repetition. 



V. MODELCC MODEL CONSTRAINTS 

Once we have examined the mechanisms that let us 
create abstract syntax models in ModelCC, we now pro- 
ceed to describe how constraints can be imposed on such 
models in order to establish the desired ASM-CSM map- 
ping. As soon as that ASM-CSM mapping is established, 
ModelCC is able to generate the suitable parser for the 
concrete syntax defined by the CSM. 

In ModelCC, the constraints imposed over abstract 
syntax models to define a particular ASM-CSM mapping 
are declared as metadata annotations on the model itself. 
Now supported by all the major programming platforms, 



metadata annotations have been used in reflective pro- 
gramming and code generation [Tn | . Among many other 
things, they can be employed for dynamically extend- 
ing the features of your software development runtime 
Q or even for building complete model-driven software 
development tools that benefit from the infrastructure 
provided by your compiler and its associated tools (2lj . 

In ModelCC, which supports language composition 
without being scannerless, metadata annotations are 
used for pattern specification, a necessary feature for 
defining the lexical elements of the concrete syntax 
model, i.e. its tokens (Subsection A). 

Annotations are also employed for defining delimiters 
in the concrete syntax model, whose use is common 
for eliminating language ambiguities or just as syntac- 
tic sugar in many languages (Subsection B). 

A third group of ModelCC metadata annotations lets 
us impose cardinality constraints on language elements, 
which control element repeatability and optionality (Sub- 
section C). 

Finally, a fourth set of metadata annotations lets us 
impose evaluation order constraints in ModelCC, which 
are employed to declaratively resolve further ambiguities 
in the concrete syntax of a textual language, by estab- 
lishing associativity, precedence, and composition con- 
straints, the latter employed for resolving the ambiguities 
that cause the typical shift-reduce conflicts in LR parsers 
(Subsection D). 

A summary of the complete set of annotations sup- 
ported by ModelCC can be found at the end of this sec- 
tion, after the detailed description of each of the four 
groups of ModelCC metadata annotations. 

A. Pattern Specification Constraints 

Pattern specification constraints allow the specification 
of the lexical elements in a concrete syntax model, i.e. 
the different token types defined for the concrete syntax 
of a textual language. It should be noted that, once a 
language element is annotated with pattern specification 
constraints, it cannot be a composite element since, as a 
lexical element, it cannot be composed of other elements. 

The above constraint, which forces lexical elements to 
be basic language elements in a ModelCC ASM, does 
not reduce the flexibility of ModelCC for language com- 
position. Language composition, typically achieved by 
scannerless parser generators, can also be achieved if the 
scanner supports lexical ambiguities. When lexical am- 
biguities are allowed, as in the Lamb scanning algorithm 
5l| , the same string or even overlapping strings might re- 
turn several tokens, which will later be processed by the 
parser consuming the output of the scanner supporting 
lexical ambiguities. 

1. The ©Pattern annotation 

The ©Pattern annotation allows the specification of 
the pattern that will be used to match a basic lan- 
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©Pattern (regExp="[a-zA-Z][ a-zA-ZO-9]*") 
Identifier 



Figure 6 Pattern specification example: Regular expression. 
[a-zA-Z] [_a-zA-Z0-9] * return Identifier; 

Figure 7 Implementation of Figure [6] in lex. 

©Pattern (match er=JavaDocRecognizer,args= "simple") 
JavaDoc 



Figure 8 Pattern specification example: Custom pattern 
matching. 



guage element in the input string. Two mutually exclu- 
sive mechanisms are provided for pattern specification in 
ModelCC: regular expressions and user-defined pattern 
matching classes. Regular expressions can be specified in 
ModelCC to build standard lexers, whereas custom pat- 
tern matching classes allow the language designer to use 
any custom-defined matching element to recognize basic 
language elements in the input string. The custom pat- 
tern matching class can be anything, since it works as 
a black box for ModelCC. It might even be a complete 
ModelCC-generated parser, which could be used for the 
specification of modular languages, a coarse form of lan- 
guage composition (e.g. think of the JavaScript scripts 
and CSS stylesheets within the HTML in a web page). 

When used with regular expressions, the ©Pattern an- 
notation includes an argument representing the regular 
expression. This regular expression, specified as the reg- 
Exp field of the annotation, corresponds to the traditional 
token type definition in lex-like scanners. 

When used with custom pattern matching classes, the 
©Pattern annotation is used to specify the name of the 
class implementing the matching algorithm and its argu- 
ment string. In this case, ModelCC resorts to the Lamb 
lexer [51] , which will employ the pattern matching class 
specified by the matcher field of the ©Pattern annota- 
tion to process the input string. This Lamb-specific [5l| 
lexical definition makes use of Lamb support for lexi- 
cal ambiguities, overlapping tokens, and black-box token 
recognizers. 

For example, the ©Pattern annotation in Figure |5] de- 
fines the typical Identifier token in programming lan- 
guages, which can be specified by the following regu- 
lar expression: [a-zA-Z][_a-zA-Z0-9]*. This specification 
corresponds to the lex token definition shown in Figure 

m 

The example in Figure [5] illustrates the use of cus- 
tom pattern matching classes in ModelCC. In this case, 
the ©Pattern annotation refers to the JavaDocRecognizer 
class, which will be responsible for recognizing JavaDoc 
comments as basic language elements in the shown ASM. 



InteqerLiteral 

— @ Value value | long 



Figure 9 Value field specification example: Integer literals. 

[0-9]+ { 
yylval. value = atoi (yytext) ; 
return INTEGERLITERAL; 

} 

Figure 10 Implementation of Figure [9] using lex & yacc. 

Arguments can also be specified when using the matcher 
field of the ©Pattern annotation with the help of its op- 
tional args field ( "simple " in Figure |5J . 

As mentioned above, the ModelCC use of custom pat- 
tern matching algorithms for defining basic language el- 
ements has no counterpart in traditional lexer genera- 
tors, hence an equivalent lex-like definition cannot be 
provided. In ModelCC, the Lamb scanning algorithm 
51] will be responsible for processing such token defini- 
tions. 



2. The OValue annotation 

The ©Value annotation can be used in combination 
with the ©Pattern annotation in ModelCC to indicate 
the location where the recognized token value will be 
stored in the abstract syntax graph, so that value can 
be used once the input string has been parsed. 

Associated to a field of the class defining a basic lan- 
guage element, that field will contain the value obtained 
from the input string that matches the token type pat- 
tern specification. 

When a numeric or boolean field is annotated with 
the © Value annotation, it is not necessary to specify the 
corresponding ©Pattern annotation for recognizing the 
numeric or boolean tokens. When the @ VaZue-annotated 
field is not numeric nor boolean (e.g. a string, a single 
character, or any non-primitive data type), the use of the 
©Pattern annotation is mandatory. 

Elements from the ASM that contain a ©Value- 
annotated numeric or boolean field are transformed into 
their corresponding token type lex-like definitions, even 
when no ©Pattern annotation is present. ModelCC will 
always perform the proper assignment of the recognized 
token to the © VaZite-annotated field. 

For example, the model in Figure |H] defines an Inte- 
gerLiteral language element that recognizes long integer 
literals. The proper regular expression for such literals 
will be employed and, whenever an integer literal is found 
in the input string, its integer value will be stored in the 
(§> Va/we-annotated value field of the IntegerLiteral class. 
If we had used lex & yacc, we would have had to type the 
lexical definition and semantic action shown in Figure [7] 

As another example, Figure QT] shows how the ©Value 
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@Pattern(regExp="\"["\"]*\"") 



@Prefix("main") 



StrinqLiteral 



@ Value text : String 



Program 



main : Statement 



Figure 11 Value field specification example: Double-quoted 
string literals. 

\"[~\"]*\" { 
int size, csize; 
size = strlen(yytext) -2 ; 
csize = sizeof (char) * (size+1) ; 
yylval.pval = malloc (csize) ; 
strncpy (yylval . pval ,yytext+l , size) ; 
yylval .pval [size] = '\0'; 
return STRINGLITERAL; 



Figure 12 Implementation of Figure [TT] using lex & yacc. 



annotation would be used in conjunction with the ©Pat- 
tern annotation to define string literals surrounded by 
double quotes. A String Literal will be recognized when- 
ever a pair of double quotes encloses a string not contain- 
ing a double quote, a constraint that can be specified by 
the following regular expression: \ "f \ "]*\ ". The lex & 
yacc implementation of this string literal token type defi- 
nition would be slightly more complex than the previous 
example, as shown in Figure [T21 



B. Delimiter Constraints 

Delimiter constraints allow the specification of lan- 
guage element delimiters in a concrete syntax model. De- 
limiters include prefixes, suffixes, and separators. Such 
kinds of delimiters are often used to eliminate language 
ambiguities and facilitate parsing, but they can appear 
just as syntactic sugar to make languages more readable. 

Usually, reserved words in programming languages act 
just as delimiters. As such, they will not appear in the 
language abstract syntax model. They will be specified 
as metadata annotations in the ASM-CSM mapping cor- 
responding to the concrete syntax of the language. 

It should be noted that delimiters should always be 
specified as constraints on the ASM-CSM mapping rather 
than language elements in the ASM, since they do not 
provide any relevant information from the perspective of 
the abstract syntax model, even though they might be 
necessary to define unambiguous textual languages. 



1. The ©Prefix annotation 

The ©Prefix annotation allows the specification of pre- 
fixes for language elements and specific constituents in 
composite language elements. 

The value field of the ©Prefix annotation is used to 
specify the list of regular expressions that define the pre- 
fixes that precede the corresponding language element 



Statement 



Figure 13 ©Prefix annotation example. 



(or a specific constituent element within a composite el- 
ement) in the concrete syntax of a textual CSM. 

When converting the ASM into a textual CSM, Mod- 
elCC will include the specified prefixes in every produc- 
tion where the annotated element appears in the textual 
CSM grammar, just before the appearance of the an- 
notated element. When the annotation is associated to 
a constituent element within a composite language ele- 
ment, the sequence of prefixes will be included only in 
the productions that correspond to the composite lan- 
guage element, preceding the annotated constituent ele- 
ment within their right-hand side. 

It should be noted that, when the annotated element 
is repeatable, the sequence of prefixes appear only once, 
preceding the first instance of the annotated element. 
Prefixes will also be included in the CSM even when no 
elements appear in a repetition language construct (e.g. 
as when the opening parenthesis appears before an empty 
list of arguments in a parameterless C-like function call) . 

For example, the model in Figure H3] specifies that 
the textual representation of a Program will always be 
preceded by a "main" keyword prefix. The grammar 
defining the textual CSM for this simple example is 
<ProgramMain> ::— "main" <Statement>. 



2. The ©Suffix annotation 

The ©Suffix annotation allows the specification of suf- 
fixes for language elements and specific constituents in 
composite language elements. 

The value field of the ©Suffix annotation is used to 
specify the list of regular expressions that define the suf- 
fixes that follow the corresponding language element (or a 
specific constituent element within a composite element) 
in the concrete syntax of a textual CSM. 

When converting the ASM into a textual CSM, Mod- 
elCC will include the specified suffixes in every produc- 
tion where the annotated element appears in the textual 
CSM grammar, just after the appearance of the anno- 
tated element. When the annotation is associated to 
a constituent element within a composite language el- 
ement, the sequence of suffixes will be included only in 
the productions that correspond to the composite lan- 
guage element, following the annotated constituent ele- 
ment within their right-hand side. 

It should be noted that, when the annotated element 
is repeatable, the sequence of suffixes appear only once, 
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@Prefix("input") 
@Suffix(" ; ") 
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ids : Identifier 
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@Suffix(")") 



-ids 



Identifier 



Figure 14 ©Suffix annotation example. 
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Figure 15 Default ©Separator example. 



just after the last element in the sequence of repetitions. 
Suffixes will also be included in the CSM even when no 
elements appear in a repetition language construct (e.g. 
as when the closing parenthesis appears at the end of an 
empty list of arguments in a parameterless C-like func- 
tion call). 

For example, the model in Figure [14] specifies that the 
textual representation of an InputStatement is preceded 
by an "input" keyword prefix and followed by a semi- 
colon (";") as its suffix. It also contains a sequence of 
Identifiers delimited by opening and closing parenthe- 
ses: "(" as the prefix of the ids constituent of the In- 
putStatement composite and ")" as its suffix. The gram- 
mar defined by the ASM-CSM mapping specified by the 
annotations in Figure [U is <InputStatement> ::= "in- 
put" "(" < Identifier List> ")" <IdentifierList> ::= 
<Identifier> <IdentifierList> | <Identifier>. 



3. The ©Separator annotation 

The ©Separator annotation allows the specification 
of separators between consecutive instances of elements 
within a repetition. Separators can be defined in Mod- 
elCC by annotating a language element in the ASM or 
just its appearance within a particular repetition con- 
struct. In the first case, a default separator is estab- 
lished for the language element: the specified separator 
will be used for separating consecutive instances of the 
annotated language element whenever a sequence of such 
language elements appears in a textual CSM. In the sec- 
ond case, an ad hoc separator is defined: the specified 
separator will be used only when consecutive instances 
of the language element appear within the context of the 
annotated repetition construct. 

The ad hoc definition of separators with the ©Separa- 
tor annotation within repetition constructs can be used 
to override the default sequence of separators associated 
to the repeatable element in a repetition construct (or 
just to disable the use of separators for that specific rep- 
etition) . 

Therefore, default separators are specified for language 
elements by the ©Separator annotation. When convert- 
ing the ASM into a textual CSM, ModelCC will include 



the sequence of regular expressions defining those default 
separators in every recursive production rule generated 
from a repetition where the annotated element is repeat- 
able. When the ©Separator annotation accompanies a 
repeatable element within a particular repetition con- 
struct, i.e. the ad hoc case, separators will only appear 
in the recursive production rule derived from that par- 
ticular repetition construct, but not in other constructs 
where the constituent element might also be repeatable. 

As an example of defining a default separator, Fig- 
ure [15] illustrates how a comma (",") can be used 
as the default separator for Identifiers. Whenever a 
list of Identifiers is needed within the language, a 
comma will separate consecutive identifiers. In the 
example, since a VariableDeclaration contains a Type 
and a set of Identifiers, they will be separated by 
"," in the textual CSM derived from the language 
ASM. The grammar of the resulting CSM will include 
the following productions: < VariableDeclaration> ::= 
<Type> <IdentifierList> ";" and < Identifier List> ::— 
<Identifier> "," <IdentifierList> \ <Identifier>. 

As an example illustrating the use of ad hoc separa- 
tors, consider the model in Figure 1161 Here, Identifiers 
are also separated by commas, but only within Input- 
Statements, i.e. "," is the ad hoc separator for identifiers 
within input statements, but lists of identifiers might em- 
ploy different separators elsewhere. The grammar asso- 
ciated to the textual CSM derived from Figure [TBI would 
include the following productions: <InputStatement> 
::= "input" "(" <InputStatementIdentifierList> ")" ";" 
and <InputStatementIdentifierList> ::= <Identifier> 
"," <InputStatementIdentifierList> \ <Identifier>. 

C. Cardinality Constraints 

A third group of ModelCC metadata annotations lets 
us impose cardinality constraints on language elements, 
which control element repeatability and optionality. 

1. The ©Optional annotation 

The ©Optional annotation allows the specification of 
optional elements in textual CSMs. 

Optional elements naturally appear in language spec- 
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Figure 16 Ad hoc ©Separator example. 
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-cond 



Expression 



ifications and optionality could always be modeled by 
means of selection constructs. However, the declarative 
specification of the optionality constraints is necessary to 
avoid unnecessary duplication in the language model. 

When one of the constituent elements within a compos- 
ite language element is optional, the textual representa- 
tion of the composite element might include the optional 
element, along with its corresponding delimiters, or not. 
In the latter case, the missing element delimiters are not 
included in the textual representation the composite ele- 
ment either, even though a prefix and a suffix might have 
been defined for the missing constituent element. 

If we performed a naive transformation of a compos- 
ite language into a set of CFG production rules and the 
composite element includes i optional elements, 2 l pro- 
duction rules would result with the composite element in 
their left-hand side and all the possible combinations of 
optional elements in their right-hand side. A more rea- 
sonable transformation employs just i ancillary epsilon 
production rules. 

For example, the model in Figure [T7] shows that a 
ConditionaiStatement contains an Expression, the State- 
ment that will be run when the Expression evalu- 
ates to true, and, optionally, the Statement that will 
be run when the Expression evaluates to false. The 
grammar resulting from the model transformation into 
a textual CSM will include the following two produc- 
tions: <ConditionalStatement> ::= "if" <Expression> 
<Statement> <OptionalElse> and <OptionalElse> ::— 
"else" <Statement> | e. 



2. The ©Minimum annotation 

The ©Minimum annotation, depicted as a minimum 
multiplicity constraint in standard UML notation, allows 
the specification of the lower bound for the multiplicity 
of repeatable language elements within repetition con- 
structs. This lower bound is 1 by default. 

It should be noted that, when the minimum multiplic- 
ity is 0, no elements might appear in a particular in- 
stance of the repetition. However, delimiters would still 
be represented in the textual CSM unless the ©Optional 



Figure 17 ©Optional element example: if-then-else state- 
ment. 
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Figure 18 Minimum multiplicity example. 



annotation were explicitly employed. 

ModelCC generates semantic actions that check that 
multiplicity constraints are satisfied whenever they are 
specified in the model. The Fence parser [50] allows 
semantic actions to implement such multiplicity checks 
and, when they are not satisfied, the corresponding re- 
duction is automatically inhibited. 

Other parser generators would require the explicit gen- 
eration of a grammar representing the minimum car- 
dinality constraint, i.e. when an element must ap- 
pear at least i times within a repetition, two produc- 
tion rules would be necessary: <MinIElements> ::= 
<Element> ... i times... <Element> <ElementList> 
and <ElementList> ::= <Element> <ElementList> \ e. 

For example, the model in Figure [18] specifies that 
an Expressions et, which might include or more 
Expressions. In UML notation, no explicit ©Minimum 
annotation is needed, since the minimum multiplicity 
constraint in the exps association has the same pur- 
pose. The grammar corresponding to the textual CSM 
derived from Figure [T5] would include the following 
productions to represent the possibility of empty sets: 
< Expressions et> ::= "{" <OptionalExpressionList> 
"}"; <OptionalExpressionList> ::= <ExpressionList> 
e; <ExpressionList> ::= <Expression> "," 
<ExpressionList> \ <Expression>. 
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3. The ©Maximum annotation 

The ©Maximum annotation, depicted as a maximum 
multiplicity constraint in standard UML associations, al- 
lows the specification of the upper bound for the multi- 
plicity of repeatable language elements within repetition 
constructs. This upper bound is undefined by default. 

ModelCC incorporates semantic actions that check 
that multiplicity constraints are satisfied whenever they 
are specified in the model. The Fence parser [50] allows 
semantic actions to implement such multiplicity checks 
so that, when they are not satisfied, the corresponding 
reduction is automatically inhibited. In other words, if 
an specific upper bound is surpassed for a list of ele- 
ments, the generated parser would not recognize the list 
of elements as such, since it does not satisfy the cardi- 
nality constraint imposed by the maximum multiplicity 
annotation. 

Other parser generators would require the generation 
of more complex grammars to support maximum mul- 
tiplicity constraints. In general, if i is the maximum 
multiplicity, i alternative production rules would be nec- 
essary (assuming that the default minimum multiplicity 
holds, i.e. 1). When both the minimum i and the maxi- 
mum j multiplicities are specified, one production would 
be used for representing the minimum multiplicity con- 
straint and j—i additional productions would be needed 
to enforce the maximum multiplicity constraint. More 
complex combinations of multiplicity constraints could 
also be devised. 

For example, the model in Figure [T2] indicates that a 
Program might have from to 2 Parameters. Here, both 
a minimum and a maximum multiplicity annotations are 
inferred from the multiplicity of the UML association. 
The grammar corresponding to the textual CSM de- 
rived from Figure [TH] would include the following produc- 
tion rules: <Program> ::= <OptionalParameterList> ; 
<OptionalParameterList> ::= < Parameter List> \ e; 
< Parameter List> ::= <Parameter> < Parameter List> 
| <Parameter>, where the last production would be 
accompanied by a semantic action that would check 
whether the maximum multiplicity constraint holds. If 
such feature were not available in our parsing algorithm 
generator, this last production would have to be replaced 
by a much more explicit (and potentially verbose) set of 
equivalent productions incorporating the maximum mul- 
tiplicity constraint: <ParameterList> ::= <Parameter> 
| <Parameter> <Parameter>. This approach poses no 
problems in this simple example, but it might get much 
more complicated (no problem yet whenever the resulting 
grammar is automatically generated by a model-driven 
language specification tool). 



D. Evaluation Order Constraints 

A fourth set of ModelCC metadata annotations lets 
us impose evaluation order constraints, which are em- 
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Figure 19 Maximum multiplicity example. 

ployed to declaratively resolve syntactic ambiguities in 
the concrete syntax of a textual language by establishing 
associativity, composition, and precedence constraints for 
CSMs. 

1. The ©Associativity annotation 

The @ Associativity annotation allows the specification 
of the operator-like associativity of language elements. 
ModelCC supports the following options for specifying 
associativity constraints: 

• UNDEFINED, when no associativity is declared 
(by default), all possibilities are considered. 

• LEFT_ TO-RIGHT, for left-associative operations 
(e.g. substraction, division, or function applica- 
tion) . 

• RIGHT_ TO-LEFT, for right-associative elements 
(e.g. exponentiation and function definition). 

• NON-ASSOCIATIVE, for non-associative elements 
(e.g. cross of three vectors). 

The specification of associativity constrains help us re- 
solve ambiguities that might appear in recursive com- 
positions (i.e. when using the composite design pattern 
[20| for modeling the ASM for operations without explicit 
delimiters), where different interpretations of the input 
string could be given unless the associativity constraints 
impose an order on the reductions that can be performed 
(either left-to- right or right-to- left). 

For each production in the CSM grammar where an 
the nonterminal of a language element with associativity 
constraints is preceded and/or followed by the nontermi- 
nal that appears on the left-hand side the production or 
any of its superclasses in the ASM, ModelCC generates 
a semantic action enforcing that associativity constraint. 
That semantic action, which inhibits the reduction of the 
production when the constraint is not met, is directly 
supported by the Fence parsing algorithm [Hoj ]. Fence 
inhibits the corresponding reduction in three situations: 

• Whenever the element that follows a left-to-right 
associative element was generated by a reduction 
of the same production. 

• Whenever the element that precedes a right-to-left 
associative element was generated by a reduction 
of the same production. 
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Figure 20 Associativity constraint example. 
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• Whenever an element that precedes or follows a 
non-associative clement was generated by a reduc- 
tion of the same production. 

The parsing algorithms implemented by other parser 
generators also offer mechanisms to enforce associativity 
constraints. 

For example, the model in Figure [20] estab- 
lishes that Binary Operators are left-associative. 
The grammar for the resulting textual CSM 
would include the productions <Expression> ::= 
<BinaryExpression> and <BinaryExpression> ::= 
<Expression> < Binary Operator> <Expression>, where 
associativity is not explicit. Left-associativity will be 
imposed by the corresponding parser semantic action. 

2. The OCom position annotation 

The ©Composition annotation allows the specification 
of the suitable order of evaluation of compositions rep- 
resented in a CSM, a situation that appears whenever 
the composite design pattern [2(| is present in the ASM, 
no delimiters are employed to eliminate potential ambi- 
guities, and the composite contains several consecutive 
components of the same type of the composite. When 
such composites are nested, different interpretations are 
possible unless we specify composition constraints in the 
CSM. This is the case of the typical shift-reduce conflicts 
that appear in LR parsers when parsing nested if-then- 
else statements. 

Hence, a specific constraint on element composition 
must be used to enforce a particular interpretation of 
such nested compositions in the ASM-CSM mapping. 
ModelCC supports the following options for composition 
constraints: 

• UNDEFINED, when no composition constraints 
are defined and potential ambiguities are taken into 
account. 

• EAGER, when the matching of constituent ele- 
ments is performed as soon as possible. This cor- 



Figure 21 Composition constraint example. 



responds to the typical interpretation of nested 
if-then-else statements in programming languages, 
where the else clause is attached to the innermost 
if statement. 

• LAZY, when the matching of constituent elements 
is deferred as much as possible. Then, a rightmost 
derivation is obtained; i.e. when an element might 
accompany any of two nested language constructs, 
it is associated to the outermost one. 

• EXPLICIT, when no composition constraints are 
defined and any ambiguities should be resolved 
with the help of delimiters. 

Composition order constraints are enforced by defin- 
ing precedences for the productions in the grammar of 
the resulting textual CSM. Establishing such precedences 
is possible in most parsing tools, including all yacc [28[ 
derivatives and Fence (sjj. When composition is eager, 
shift operations will precede reduce operations. In con- 
trast, when composition is lazy, reduce operations will 
have precedence over shift operations. Finally, when the 
composition order must be explicit in the CSM, the use 
of delimiters will determine whether shift or reduce op- 
erations are performed on a case by case basis. 

For example, the model in Figure [21] represents typ- 
ical if-then-else statements. In this case, the optional 
else Statement of the eager ConditionalStatement will al- 
ways match the innermost if statement when such state- 
ments are nested, e.g. in "if El if E2 SI else S2", 
the else clause will correspond to the E2 if statement. 
The grammar for the resulting CSM will include the fol- 
lowing productions: <ConditionalStatement> ::— "if" 
<Expression> <Statement> ";" \ "if" <Expression> 
<Statement> "else" <Statement> The parser will 
enforce the precedence of the second alternative over the 
first one, so that else clauses are parsed as usual. 
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Figure 22 Relative priority resolved by the lexical analyzer. 



3. The ©Priority annotation 

The ©Priority annotation allows the specification of 
precedences among language elements for eliminating 
ambiguities in textual CSMs. 

ModelCC implements two mechanisms to specify pri- 
ority constraints in the ASM-CSM mapping: 

• A relative one, where precedence relationships are 
established between particular language elements 
(a precedes declaration indicates which language 
elements have lower priority than the current el- 
ement). 

• An absolute one, where a numeric priority value 
determines the priority level for each language ele- 
ment (the lower the value, the higher the priority) 

Unless specified otherwise, all language elements have 
the same priority. Precedences established among basic 
language elements must be managed at the lexical anal- 
ysis level. In ModelCC, the Lamb scanning algorithm 
[511 ] enforces those lexical precedences. The Fence pars- 
ing algorithm [5(| manages all the remaining precedences 
that can be established among the non-basic language el- 
ements that appear in concatenation, selection, and rep- 
etition constructs within the CSM. 

For example, the model in Figure[22]establishes a (ficti- 
tious) relative priority constraint between function names 
and identifiers: a FunctionName will always precede an 
Identifier. In case a string like "func_power" is found in 
the input string, it will be recognized as a FunctionName, 
but not as an Identifier. Since the constraint is defined 
over basic language elements, the lexical analyzer gener- 
ated by ModelCC will be responsible for identifying the 
right element in the input string. 

Figure [53] shows another example. In this case, 
the model enforces relative priority constraints between 
composite language elements that will be resolver by 
the parser generated by ModelCC. Here, output state- 
ments precede function calls so that a string like "out- 
put(3+5,4+l);" will be recognized as an OutputState- 
ment but not as an FunctionCallStatement, even when 
"output" would be a perfectly valid indentifier. The 
ModelCC lexer, which supports lexical ambiguities, will 
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Figure 23 Relative priority resolved by the syntactic analyzer. 



consider "output" as an Identifier and also as a delim- 
iter for output statements (i.e. its prefix keyword in the 
CSM). 



E. Summary of ModelCC ASM-CSM Constraints 

Tabic [J summarizes the set of constraints supported by 
ModelCC for establishing ASM-CSM mappings between 
abstract syntax models and their concrete representation 
in textual CSMs: 

• Pattern specification constraints are employed to 
specify the textual representation of basic language 
elements in the CSM. They define the token types 
recognized by the lexer and indicate where the rec- 
ognized tokens will be stored in the ASM. 

• Delimiter constraints determine the prefixes, suf- 
fixes, and separators that will be used to mark the 
boundaries of language elements in the CSM. They 
can be used for eliminating language ambiguities or 
just as syntactic sugar in text-based CSMs. 

• Cardinality constraints restrict the multiplicity of 
repetitions and determine the optionality of lan- 
guage elements. The CSM must consider such con- 
straints for defining the grammar that defines the 
language recognized by the generated parser. 

• Finally, evaluation order constraints allow the ex- 
plicit resolution of different kinds of lexical and syn- 
tactic ambiguities that might appear in the ASM- 
CSM mapping. 
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Table 1 Summary of the metadata annotations supported by ModelCC. 



VI. A WORKING EXAMPLE 

In this section, we compare how an interpreter for 
arithmetic expressions can be implemented using con- 
ventional tools and how it can be implemented using the 
model-driven approach using ModelCC. Albeit the cal- 
culator example in this section is necessarily simplistic, 
it already provides some hints on the potential benefits 
model-driven language specification can bring to more 
challenging endeavors. 

First, we will outline the features we wish to include 
in our calculator language. Later, we will describe how 
an interpreter for this language is built using two of the 
most established tools in use by language designers: lex 
& yacc on the one hand, ANTLR on the other. Finally, 
we will implement the same language processor using 
ModelCC by denning an abstract syntax model. This 
ASM will be annotated to specify the required ASM-CSM 
mapping and it will also include the necessary logic for 
evaluating arithmetic expressions. This example will let 
us compare ModelCC against conventional parser gen- 
erators and it will be used for discussing the potential 
advantages provided by our model-driven language spec- 
ification approach. 

A. Language Description 

Our calculator will employ classical arithmetic expres- 
sions in infix notation. The language will feature the 
following capabilities: 

• Unary operators: +, and -. 

• Binary operators: +, -, *, and /, being - and / 
left-associative. 

• Operator priorities: * and / precede + and -. 

• Parenthesized expressions. 

• Integer and floating-point literals. 



B. Conventional Implementation 

Using conventional tools, the language designer would 
start by specifying the grammar defining the calculator 
language in a BNF-like notation. The BNF grammar 
shown in in Figure [2H meets the requirements of our sim- 
ple calculator, albeit it is not yet suitable for being used 
with existing parser generators, since they impose specific 
constraints on the format of the grammar depending on 
the parsing algorithms they employ. 



1. Lex & yacc implementation 

When using lex & yacc, the language designer con- 
verts the BNF grammar into a grammar suitable for LR 
parsing. A suitable lex/yacc implementation defining the 
arithmetic expression grammar is shown in Figure 1251 

Since lex does not support lexical ambiguities, the 
Unary Operator and Binary Operator nonterminals from 
the BNF grammar in Figure [23] have to be refactored 
in order to avoid the ambiguities introduced by the use 
of + and - both as unary and binary operators. A 
typical solution consists of creating the UnaryOrPrior- 
ity2BinaryOperator token type for representing them and 
then adjusting the grammar accordingly. This token will 
act as an UnaryOperator in Unary Expressions, and as a 
Binary Operator in Binary Expressions. 

A similar solution is necessary for distinguishing dif- 
ferent operator priorities, hence different token types are 
defined for each precedence level in the language, even 
though they perform the same role from a conceptual 
point of view. The order in which they are declared in 
the yacc specification determines their relative priorities 
(please, note that these declarations are also employed to 
define operator associativity). 

Unfortunately, the requirement to resolve ambiguities 
by refactoring the grammar defining the language in- 
volves the introduction of a certain degree of duplication 
in the language specification: separate token types in the 
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<Expression> ::= <ParenthesizedExpression> 
I <BinaryExpression> 
I <UnaryExpression> 
I <LiteralExpression> 

<ParenthesizedExpression> ::= '(' <Expression> ')' 

<BinaryExpression> ::= <Expression> <BinaryOperator> <Expression> 

<UnaryExpression> ::= <UnaryOperator> <Expression> 

<LiteralExpression> ::= <RealLiteral> 

I <IntegerLiteral> 

<BinaryOperator> ::= >+' I >-> I >/> I >*' 

<UnaryOperator> ::= '+' I '-' 

<RealLiteral> ::= <IntegerLiteral> '.' I <IntegerLiteral> '.' <IntegerLiteral> 

<IntegerLiteral> ::= <Digit> <IntegerLiteral> I <Digit> 

<Digit> ::= '0' I >1> I >2> I '3' I >4> I '5' I '6' I '7' I '8' I >9> 



Figure 24 BNF grammar for the arithmetic expression language. 



// Lex specification ['calc.lex'] 

u 

#include "y.tab.h" 
extern YYSTYPE yylval; 
'/.} 
7.7. 

[0-9] +\. [0.9]* return RealLiteral; 



[0-9]+ return IntegerLiteral ; 

\+|\- return Unary0rPriority2Binary0perator ; 

\/|\* return Prior itylBinaryOperator; 

\( return Lef tParenthesis ; 

\) return RightParenthesis ; 
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II Yacc specification ['calc.yacc'] 

7,lef t Unary0rPriority2Binary0perator 
'/.left PrioritylBinaryOperator 

"/token IntegerLiteral RealLiteral Lef tParenthesis RightParenthesis 

"/start Expression 

7.7. 

Expression : RealLiteral 

I IntegerLiteral 

I Lef tParenthesis Expression RightParenthesis 

I Unary0rPriority2Binary0perator Expression 

I Expression Unary0rPriority2Binary0perator Expression 

I Expression PrioritylBinaryOperator Expression 

> 

7.7. 

#include "lex.yy.c" 

int main(int argc.char *argv[]) { yyparseO; } 
int yyerror(char *s) { printf C"/s" , s) ; } 



Figure 25 lex & yacc specification of the arithmetic expression language. 
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lexer and multiple parallel production rules in the parser. 

Once all ambiguities have been resolved, the language 
designer completes the lex & yacc introducing semantic 
actions to perform the necessary operations. In this case, 
albeit somewhat verbose in C syntax, the implementa- 
tion of an arithmetic expression evaluator is relatively 
straightforward using the yacc $ notation, as shown in 
Figure 1261 In our particular implementation of the cal- 
culator interpreter, carriage returns are employed to out- 
put results, hence our use of the ancillary Line token type 
and LineReturn nonterminal symbol. 



2. ANTLR implementation 

When using ANTLR, the language designer converts 
the BNF grammar into a grammar suitable for LL pars- 
ing. An ANTLR specification of our arithmetic expres- 
sion language is shown in Figure l27l 

Since ANTLR provides no mechanism for the declara- 
tive specification of token precedences, such precedences 
have to be incorporated into the grammar. The usual so- 
lution involves the creation of different nonterminal sym- 
bols in the grammar, so that productions corresponding 
to the same precedence levels are grouped together. The 
productions with expression! and expression2 in their 
left-hand side were introduced with this purpose in our 
calculator grammar. 

Likewise, since ANTLR generates a LL(*) parser, 
which does not support left-recursion, left-recursive 
grammar productions in the grammar shown in Figure 
[Ml have to be refactored. In our example, a simple solu- 
tion involves the introduction of the expressions nonter- 
minal, which in conjunction with the aforementioned ex- 
pression! and expression2 nonterminals, eliminates left- 
recursion from our grammar. 

Once the grammar is adjusted to satisfy the constraints 
imposed by the ANTLR parser generator, the language 
designer can define the semantic actions needed to im- 
plement our arithmetic expression interpreter. The re- 
sulting ANTLR implementation is shown in Figure [28] 
The streamlined syntax of the scannerless ANTLR parser 
generator makes this implementation significantly more 
concise than the equivalent lex & yacc implementation. 
However, the constraints imposed by the underlying pars- 
ing algorithm forces explicit changes on the language 
grammar (cf. BNF grammar in Figure [23]). 



C. ModelCC Implementation 

When following a model-based language specification 
approach, the language designer starts by elaborating an 
abstract syntax model, which will later be mapped to 
a concrete syntax model by imposing constraints on the 
abstract syntax model. These constraints can also be 
specified as metadata annotations on the abstract syn- 
tax model and the resulting annotated model can be pro- 



cessed by automated tools, such as ModelCC, to generate 
the corresponding lexers and parsers. Annotated models 
can be represented graphically, as the UML diagram in 
Figure 1291 or implemented using conventional program- 
ming languages, as the Java implementation listed in Fig- 
ure [301 

In the current version of ModelCC, annotated models 
representing the ASM and a particular ASM-CSM map- 
ping are used to generate Lamb lexers [5l[ and Fence 
parsers [50| . albeit traditional LL and LR parsers might 
also be generated whenever the ASM-CSM mapping con- 
straints make LL and LR parsing feasible. 

Since the abstract syntax model in ModelCC is not 
constrained by the vagaries of particular parsing algo- 
rithms, the language design process can be focused on its 
conceptual design, without having to introduce artifacts 
in the design just to satisfy the demands of particular 
tools: 

• As we saw in the lex & yacc example, conventional 
tools, unless they are scannerless, force the creation 
of artificial token types in order to avoid lexical am- 
biguities, which leads to duplicate grammar pro- 
duction rules and semantic actions in the language 
specification. As in any other software development 
project, duplication hinders the evolution of lan- 
guages and affects the maintainability of language 
processors. ModelCC, even though it is not scan- 
nerless, supports lexical ambiguities and each ba- 
sic language element is defined as a separate and 
independent entity, even when their pattern spec- 
ification are in conflict. Therefore, duplication in 
the language model does not have to be included 
to deal with lexical ambiguities: token type defini- 
tions do not have to be adjusted, duplicate syntac- 
tic constructs rules will not appear in the language 
model, and, as a consequence, semantic actions do 
not have to be duplicated either. 

• As we also saw both in the lex & yacc calcula- 
tor and in the ANTLR solution to the same prob- 
lem, established parser generators require modifi- 
cations to the language grammar specification in 
order to comply with parsing constraints, let it be 
the elimination of left-recursion for LL parsers or 
the introduction of new nonterminals to restruc- 
ture the language specification so that the desired 
precedence relationships are fulfilled. In the model- 
driven language specification approach, the left- 
recursion problem disappears since it is something 
the underlying tool can easily deal with in a fully 
automated way when an abstract syntax model is 
converted into a concrete syntax model. Moreover, 
the declarative specification of constraints, such as 
the evaluation order constraints in Section IV.D1 is 
orthogonal to the abstract syntax model that de- 
fines the language. Those constraints determine 
the ASM-CSM mapping and, since ModelCC takes 
charge of everything in that conversion process, the 
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// Lex specification ['calc.lex'] 
"/.{ 

#include " string. h" 
#include "y.tab.h" 
extern YYSTYPE yylval; 
'/.} 
7.7. 

[0-9] +\ . [0 . 9] * { yylval. value = atof (yytext) ; return RealLiteral; } 

[0-9]+ { yylval. value = (double) atoi (yytext) ; return IntegerLiteral ; } 

\+l\- { 

if (yytext [0] == '+>) yylval . operator = PLUSORADDITION; 

else /* yytext [0] == '-' */ yylval . operator = MINUS0RSUBSTRACTI0N; 

return Unary0rPriority2Binary0perator ; 

} 

\/l\* { 

if (yytext [0] == '*') yylval . operator = MULTIPLICATION; 
else /* yytext [0] == '/> */ yylval . operator = DIVISION; 
return PrioritylBinaryOperator ; 

} 

\( { return Lef tParenthesis ; } 

\) { return RightParenthesis ; } 

\n { return LineReturn; } 

7.7. 



II Yacc specification ['calc.yacc'] 

7,lef t Unary0rPriority2Binary0perator 
'/.left PrioritylBinaryOperator 

"/token IntegerLiteral RealLiteral Lef tParenthesis RightParenthesis 

"/start Line 

'/.{ 

#include <stdio.h> 
#define YYSTYPE attributes 

typedef enum { PLUSORADDITION, MINUS0RSUBSTRACTI0N, MULTIPLICATION, DIVISION } optype; 
typedef struct { 

optype operator; 

double value; 
} attributes; 
"/J 
7.7, 

Expression : RealLiteral { $$. value = $1. value; } 

I IntegerLiteral { $$. value = $1. value; } 

I Lef tParenthesis Expression RightParenthesis { $$. value = $2. value; } 
I Unary0rPriority2Binary0perator Expression 
{ 

if ($1. operator == PLUSORADDITION) $$. value = $2. value; 

else /* $1. operator == MINUS0RSUBSTRACTI0N */ $$. value = -$2. value; 

} 

I Expression Unary0rPriority2Binary0perator Expression 
{ 

if ($2. operator == PLUSORADDITION) $$. value = $1 .value+$3. value; 

else /* $2. operator == MINUS0RSUBSTRACTI0N */ $$. value = $l.value-$3. value; 

} 

I Expression PrioritylBinaryOperator Expression 
{ 

if ($2. operator == MULTIPLICATION) $$. value = $1 .value*$3. value; 
else /* $2. operator == DIVISION */ $$. value = $1 .value/$3 . value ; 

} 

> 

Line : Expression LineReturn { printf 0"/f\n" ,$1 .value) ; } ; 

7.7. 

#include "lex.yy.c" 

int main(int argc.char *argv[]) { yyparseO; } 
int yyerror(char *s) { printf ( "*/s" , s) ; } 



Figure 26 Complete lex & yacc implementation of the arithmetic expression interpreter. 
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grammar ExpressionEvaluator ; 

expressionl : expression2 ( '+' expressionl I '-' expressionl )* ; 

expression2 : expression3 ( '*' expression2 I V expression2 )* ; 

expression3 : ' (' expressionl ')' 

I '+' expressionl 

I '-' expressionl 

I INTEGERLITERAL 

I FLOATLITERAL 



INTEGERLITERAL : >0>..'9'+ ; 

FLOATLITERAL : CO'.. '9') + '.' CO'.. '9')* ; 
NEWLINE : '\r'? '\n ; ; 



Figure 27 ANTLR specification of the arithmetic expression language. 



grammar ExpressionEvaluator; 

expressionl returns [double value] 

: e=expression2 {$value = $e. value;} 

( ' + ' e2=expressionl {$value += $e2. value;} 
I '-' e2=expressionl {$value -= $e2. value;} 
)* 



expression2 returns [double value] 

: e=expression3 {$value = $e. value;} 

( e2=expression2 {$value *= $e2. value;} 
I '/' e2=expression2 {$value -= $e2. value;} 
)* 



expression3 returns [double value] 

: '(' e=expressionl ')' {$value = $e. value;} 
I '+' e=expressionl {$value = $e. value;} 
I '-' e=expressionl {$value = -$e. value;} 

I i=INTEGERLITERAL {$value = (double) Integer .parselnt ($i . text) ; } 
I f =FL0 ATLITERAL {$value = Double .parseDouble($f . text) ; } 



INTEGERLITERAL : >0'..'9'+ ; 

FLOATLITERAL : CO'.. '9') + '.' CO'.. '9')* ; 
NEWLINE : *\r'? '\n' ; 



Figure 28 Complete ANTLR implementation of the arithmetic expression interpreter. 
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Figure 29 ModelCC specification of the arithmetic expression language. 



language designer does not have to modify the ab- 
stract syntax model just because a given parser 
generator might prefer its input in a particular for- 
mat. This is the main benefit that results from 
raising your abstraction level in model-based lan- 
guage specification. 

• When changes in the language specification are nec- 
essary, as it is often the case when a software system 
is successful, the traditional language designer will 
have to propagate changes throughout the entire 
language processing tool chain, often introducing 
significant changes and making profound restruc- 
turings in the working code base. The changes can 
be time-consuming, quite tedious, and extremely 
error-prone. In contrast, modifications are more 
easily done when a model-driven language specifi- 
cation approach is followed. Any modifications in 
a language will affect either to the abstract syn- 
tax model, when new capabilities are incorporated 
into a language, or to the constraints that define 
the ASM-CSM mapping, whenever syntactic de- 
tails change or new CSMs are devised for the same 
ASM. In either case, the more time-consuming, te- 
dious, and error-prone modifications are automated 
by ModelCC, whereas the language designer can fo- 
cus his efforts on the essential part of the required 
changes. 

• Finally, traditional parser generators typically mix 



semantic actions with the syntactic details of the 
language specification. This approach, which is jus- 
tified when performance is the top concern, might 
lead to poorly-designed hard-to-test systems when 
not done with extreme care. Moreover, when differ- 
ent applications or tools employ the same language, 
any changes to the syntax of that language have to 
be replicated in all the applications and tools that 
use the language. The maintenance of several ver- 
sions of the same language specification in parallel 
might also lead to severe maintenance problems. 
In contrast, the separation of concerns provided by 
ModelCC, as separate ASM and ASM-CSM map- 
pings, promotes a more elegant design for language 
processing systems. By decoupling language spec- 
ification from language processing and providing a 
conceptual model for the language, different appli- 
cations and tools can now use the same language 
without having duplicate language specifications. 
A similar result could be hand-crafted using tradi- 
tional parser generators (i.e. making their implicit 
conceptual model explicit and working on that ex- 
plicit model), but ModelCC automates this part of 
the process. 

In summary, while traditional language processing 
tools provide different mechanisms for resolving ambi- 
guities and implementing language constraints, the solu- 
tions they provide typically interfere with the conceptual 
modeling of languages: relatively minor syntactic details 
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public abstract class Expression implements IModel { 
public abstract double eval() ; 

> 

OPref ix ( " \\ ( " ) aSuf f ix ( " \\) " ) 

public class ParentheslzedExpression extends Expression implements IModel { 
Expression e; 

aOverride public double evalO { return e.evalO; } 

} 

public abstract class LiteralExpression extends Expression implements IModel { 

public class UnaryExpression extends Expression implements IModel { 
Unary Operator op; 
Expression e; 

aOverride public double evalO { return op.eval(e); } 

} 

public class BinaryExpression extends Expression implements IModel { 
Expression el; 
BinaryOperator op; 
Expression e2; 

aOverride public double evalO { return op. eval (el , e2) ; } 

} 

public class IntegerLiteral extends LiteralExpression implements IModel { 
SValue int value; 

aOverride public double evalO { return (double) value; } 

} 

public class RealLiteral extends LiteralExpression implements IModel { 
aValue double value; 

aOverride public double evalO { return value; } 

} 

public abstract class UnaryOperator implements IModel { 
public abstract double eval (Expression e) ; 

} 

8Pattern(regExp="\\+") 

public class PlusOperator extends UnaryOperator implements IModel { 
aOverride public double eval (Expression e) { return e.evalO; } 

} 

aPattern (regExp=" - " ) 

public class MinusOperator extends UnaryOperator implements IModel { 
aOverride public double eval (Expression e) { return -e.evalO; } 

} 

aAssociativity(AssociativityType.LEFT_TO_RIGHT) 
public abstract class BinaryOperator implements IModel { 
public abstract double eval (Expression el, Expression e2) ; 

} 

apriority (value=2) aPattern(regExp="\\+") 

public class AdditionOperator extends BinaryOperator implements IModel { 

aOverride public double eval (Expression el .Expression e2) { return el . eval ()+e2 . eval () ; } 

} 

aPr ior ity ( value=2) aPattern (regExp=" - " ) 

public class SubstractionOperator extends BinaryOperator implements IModel { 

aOverride public double eval (Expression el .Expression e2) { return el . evalO -e2 . evalO ; } 

} 

apriority (value=l) aPattern(regExp="\\* " ) 

public class MultiplicationOperator extends BinaryOperator implements IModel { 

aOverride public double eval (Expression el , Expression e2) { return el . evalO *e2 . evalO ; } 

} 

apriority (value=l) aPattern(regExp="\\/") 

public class DivisionOperator extends BinaryOperator implements IModel { 

aOverride public double eval (Expression el .Expression e2) { return el . evalO /e2 . evalO ; } 

} 



Figure 30 Complete Java implementation of the arithmetic expression interpreter using ModelCC: A set of Java classes define 
the language ASM, metadata annotations specify the desired ASM-CSM mapping, and object methods implement arithmetic 
expression evaluation. 
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might significantly affect the structure of the whole lan- 
guage specification. Model-driven language specification, 
as exemplified by ModelCC, provides a cleaner separation 
of concerns: the abstract syntax model is kept separate 
from its incarnation in concrete syntax models, thereby 
separating the specification of abstractions in the ASM 
from the particularities of their textual representation in 
CSMs. 



We plan to study the possibilities tools such as Mod- 
elCC open up in different application domains, includ- 
ing traditional language processing systems (compilers 
and interpreters) [3j, domain-specific languages (DSLs) 
[l9l Hl[ and language workbench es fl8t. model-driven 
software development (MDSD) tools 54][21J, natural lan- 
guage processing [29j in restricted domains, model in- 
duction, text mining applications, data integration, and 
information extraction. 



VII. CONCLUSIONS AND FUTURE WORK 



In this paper, we have introduced ModelCC, a model- 
based tool for language specification. ModelCC lets lan- 
guage designers create explicit models of the concepts 
a language represents, i.e. the abstract syntax model 
(ASM) of the language. Then, that abstract syntax 
can be represented in textual or graphical form, using 
the concrete syntax defined by a concrete syntax model 
(CSM). ModelCC automates the ASM-CSM mapping by 
means of metadata annotations on the ASM, which let 
ModelCC act as a model-based parser generator. 

ModelCC is not bound to particular scanning and pars- 
ing techniques, so that language designers do not have to 
tweak their models to comply with the constraints im- 
posed by particular parsing algorithms. ModelCC ab- 
stracts away many details traditional language process- 
ing tools have to deal with. It cleanly separates lan- 
guage specification from language processing. Given 
the proper ASM-CSM mapping definition, ModelCC- 
generated parsers are able to automatically instantiate 
the ASM given an input string representing the ASM in 
a concrete syntax. 

Apart from being able to deal with ambiguous lan- 
guages, ModelCC also allows the declarative resolution of 
any language ambiguities by means of constraints defined 
over the ASM. The current version of ModelCC also sup- 
ports lexical ambiguities and custom pattern matching 
classes. A fully-functional version of ModelCC is avail- 
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able at http://www.modelcc.org 



The proposed model-driven language specification ap- 
proach promotes the domain-driven design of language 
processors. Its model-driven philosophy supports lan- 
guage evolution by improving the maintainability of lan- 
guages processing system. It also facilitates the reuse of 
language specifications across product lines and differ- 
ent applications, eliminating the duplication required by 
conventional tools and improving the modularity of the 
resulting systems. 

In the future, ModelCC will incorporate a wider variety 
of parsing techniques and it will be able to automatically 
determine the most efficient parsing algorithm that is 
able to parse a particular language (the current version 
employs the Fence parsin g al gorithm (Hoj on top of the 
Lamb scanning algorithm |51|). 

ModelCC will also be extended to support multiple 
concrete syntax models for the same abstract syntax 
model. 
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